IBM HR Analytics Employee Attrition and Performance Dataset

IBM HR Analytics Employee Attrition and Performance Dataset

In this study, we analyze HR data available from kaggle.com

This data is fictional and it is created by IBM data scientists.

Categorical Parameters:

1 2 3 4 5
Education Below College College Bachelor Master Doctor
Environment Satisfaction Low Medium High Very High
Job Involvement Low Medium High Very High
Job Satisfaction Low Medium High Very High
Performance Rating Low Good Excellent Outstanding
Relationship Satisfaction Low Medium High Very High
WorkLife Balance Bad Good Better Best

This can be encoded as follows,

Loading the Dataset

First off, let's take a look at the dataset

Moreover,

Exploratory Data Analysis

Age

Business Travel

Daily Rate

Department

Distance from Home

Education Field

Hourly Rate

Job Involvement

Job Roles

Job Satisfaction

Marital Status

Monthly Income

Monthly Rate

Number of Companies Worked

Over Time

Percent Salary Hike

Performance Rating

Relationship Satisfaction

Stock Option Level

Total Working Years

Training Times Last Year

Work-Life Balance Score

Years at the Company

Years In Current Role

Years Since Last Promotion

Years With Current Manager

Now

Problem Description

In the dataset, Attrition represents whether an employee is churned or not. We would like to create a predictive model that predicts this feature.

We need to convert categorical data to numeric data.

We can use LabelEncoder for converting categorical to numeric using. Therefore,

Variance of the Features

Features with variance zero

First, we remove features that have zero variance as these features don't add anything to our modeling.

X and y sets

Features with high variance

Moreover, high variance for some features can hurt our modeling process. For this reason, we would like to standardize features by removing the mean and scaling to unit variance. In this article, we demonstrated the benefits of scaling data using StandardScaler().

Modifying dataset.

Saving to a CSV


References

  1. Kaggle Dataset: IBM HR Analytics Employee Attrition & Performance
  2. Getting Started with Plotly in Python